-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster CPU (Arg-)Reductions #6989
Conversation
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @manuelvogel12 thanks for this PR! Speeding up basic ops is really useful. Can I request some additional benchmarking? Essentially, we want to ensure there are no speed regressions for different use cases. An example of multiple scenarios:
number of points: 10K, 100K, 1M, 10M, 100M
number of CPU cores: 4, 8, 16, 24
[while counting cores, please ignore hyperthreading]
You can use (and update if needed) cpp/benchmarks/core/Reduction.cpp
for benchmarking this. [Build with cmake option -D BUILD_BENCHMARKS=ON
]
You can use taskset --cpu-list 0-7 open3dbenchmark...
to restrict the benchmark to only CPU cores 0-7 inclusive. [This is 4 cores for hyperthreading CPUs].
static void LaunchArgReductionParallelDim(const Indexer& indexer, | ||
func_t reduce_func, | ||
scalar_t identity) { | ||
int64_t num_output_elements = indexer.NumOutputElements(); | ||
#pragma omp parallel for schedule(static) \ | ||
num_threads(utility::EstimateMaxThreads()) | ||
for (int64_t output_idx = 0; output_idx < num_output_elements; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since usually num_output_elements << num_threads
, we should collapse the two for loops to improve thread utilization with#pragma omp for collapse(2)
I believe Visual Studio now supports this (OpenMP v3) but good to check Windows CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to remind about #6626 as this will add new OpenMP code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its possible to collapse the two loops, since the inner loop has a loop-carried dependency. What could be done is using the LaunchArgReductionKernelTwoPass
instead of the inner loop. Another option would be the use of OpenMP`s reduction clause directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When simply launching LaunchArgReductionKernelTwoPass
inside the outer for loop, it improves the performance when reducing over the large_dim, but worsens the performance of reducing over the small dim.
The image shows the time, so +4.5
means this approach would be 4.5 times slower in that benchmark,
-0.78
means roughly 4.5 times faster
Therefore, I think, when the number of output elements is smaller than the number of threads, it's best not to optimize the outer loop at all, or find a good way to calculate the number of threads for the inner loop,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @manuelvogel12 for the detailed benchmarking. This looks good to me.
To address @benjaminum's point, we will be switching from OpenMP to TBB soon (PR #6626). This helps quite a bit with lack of good support for OpenMP on Windows and Apple, simplifies the code and also comes with the benefit of composable parallelism. We also expect (some) speedup from TBB's work stealing scheduler model.
I think we can merge this PR due to its speed boost for now and then switch this code to TBB as well along with the rest of the code. @manuelvogel12 I would encourage you to check out PR #6626 and help us test that.
std::vector<int64_t> thread_results_idx(num_threads, 0); | ||
std::vector<scalar_t> thread_results_val(num_threads, identity); | ||
|
||
#pragma omp parallel for schedule(static) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this with for loop collapse in OpenMP?
This pull request aims to speed up Reductions and ArgReductions with single outputs on the CPU.
Type
Motivation and Context
For Reductions (e.g. sum), the problem of the current code is the access of multiple threads to a shared array in each loop iteration, which can be fixed using a single local variable.
Argreductions for single outputs (e.g. points.argmax()) run with the current code on a single CPU core, although the operation can be sped up in the same way as the reduction.
This PR fixes both problems and has a 400% speedup for these reductions (4x faster) and 2000% speedup for these argreductions (20x faster) on my PC.
Note that this does only apply for single-output operations performed on the CPU.
Checklist:
python util/check_style.py --apply
to apply Open3D code styleto my code.
updated accordingly.
results (e.g. screenshots or numbers) here.
Description
The benchmarking was done on a 24core cpu, using the sum() and argmax() functions of arrays with 150M elements.